In [4]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import StringIndexer, VectorAssembler


from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
# from pyspark.ml import Pipeline, PipelineStage
from pyspark.ml import Pipeline

Classification

Let’s go through an example of Credit Risk for Bank Loans:

What are we trying to predict?

  • Whether a person will pay back a loan or not.
  • This is the Label: The Creditability of a person.

What are the “if questions” or properties that you can use to predict ?

  • An applicant’s demographic and socio-economic profile: Occupation, age, savings, marital status, savings...
  • These are the Features, to build a classifier model, you extract the features of interest that most contribute to the classification.

Decision trees

Decision trees create a model that predicts the class or label based on several input features. Decision trees work by evaluating an expression containing a feature at every node and selecting a branch to the next node based on the answer. A possible decision tree for predicting Credit Risk is shown below. The feature questions are the nodes, and the answers “yes” or “no” are the branches in the tree to the child nodes.

  • Q1: Is checking account balance > 200DM ?
    • no
    • Q2: Is Length of current employment > 1 year?
      • No
      • Not Creditable

Random Forests

Ensemble learning algorithms combine multiple machine learning algorithms to obtain a better model. Random Forest is a popular ensemble learning method for Classification and regression. The algorithm builds a model consisting of multiple decision trees, based on different subsets of data at the training stage. Predictions are made by combining the output from all of the trees which reduces the variance, and improves the predictive accuracy. For Random Forest Classification each tree’s prediction is counted as a vote for one class. The label is predicted to be the class which receives the most votes.

Analyze Credit Risk with Spark Machine Learning Scenario

This is the data we will be using


In [ ]:


In [ ]: